217 research outputs found
Dual Purpose Hashing
Recent years have seen more and more demand for a unified framework to
address multiple realistic image retrieval tasks concerning both category and
attributes. Considering the scale of modern datasets, hashing is favorable for
its low complexity. However, most existing hashing methods are designed to
preserve one single kind of similarity, thus improper for dealing with the
different tasks simultaneously. To overcome this limitation, we propose a new
hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the
category and attribute similarities by exploiting the Convolutional Neural
Network (CNN) models to hierarchically capture the correlations between
category and attributes. Since images with both category and attribute labels
are scarce, our method is designed to take the abundant partially labelled
images on the Internet as training inputs. With such a framework, the binary
codes of new-coming images can be readily obtained by quantizing the network
outputs of a binary-like layer, and the attributes can be recovered from the
codes easily. Experiments on two large-scale datasets show that our dual
purpose hash codes can achieve comparable or even better performance than those
state-of-the-art methods specifically designed for each individual retrieval
task, while being more compact than the compared methods.Comment: With supplementary materials added to the en
Weakly Supervised Object Detection with Segmentation Collaboration
Weakly supervised object detection aims at learning precise object detectors,
given image category labels. In recent prevailing works, this problem is
generally formulated as a multiple instance learning module guided by an image
classification loss. The object bounding box is assumed to be the one
contributing most to the classification among all proposals. However, the
region contributing most is also likely to be a crucial part or the supporting
context of an object. To obtain a more accurate detector, in this work we
propose a novel end-to-end weakly supervised detection approach, where a newly
introduced generative adversarial segmentation module interacts with the
conventional detection module in a collaborative loop. The collaboration
mechanism takes full advantages of the complementary interpretations of the
weakly supervised localization task, namely detection and segmentation tasks,
forming a more comprehensive solution. Consequently, our method obtains more
precise object bounding boxes, rather than parts or irrelevant surroundings.
Expectedly, the proposed method achieves an accuracy of 51.0% on the PASCAL VOC
2007 dataset, outperforming the state-of-the-arts and demonstrating its
superiority for weakly supervised object detection
Fully Learnable Group Convolution for Acceleration of Deep Neural Networks
Benefitted from its great success on many tasks, deep learning is
increasingly used on low-computational-cost devices, e.g. smartphone, embedded
devices, etc. To reduce the high computational and memory cost, in this work,
we propose a fully learnable group convolution module (FLGC for short) which is
quite efficient and can be embedded into any deep neural networks for
acceleration. Specifically, our proposed method automatically learns the group
structure in the training stage in a fully end-to-end manner, leading to a
better structure than the existing pre-defined, two-steps, or iterative
strategies. Moreover, our method can be further combined with depthwise
separable convolution, resulting in 5 times acceleration than the vanilla
Resnet50 on single CPU. An additional advantage is that in our FLGC the number
of groups can be set as any value, but not necessarily 2^k as in most existing
methods, meaning better tradeoff between accuracy and speed. As evaluated in
our experiments, our method achieves better performance than existing learnable
group convolution and standard group convolution when using the same number of
groups.Comment: Accepted by CVPR 201
Learning Expressionlets via Universal Manifold Model for Dynamic Facial Expression Recognition
Facial expression is temporally dynamic event which can be decomposed into a
set of muscle motions occurring in different facial regions over various time
intervals. For dynamic expression recognition, two key issues, temporal
alignment and semantics-aware dynamic representation, must be taken into
account. In this paper, we attempt to solve both problems via manifold modeling
of videos based on a novel mid-level representation, i.e.
\textbf{expressionlet}. Specifically, our method contains three key stages: 1)
each expression video clip is characterized as a spatial-temporal manifold
(STM) formed by dense low-level features; 2) a Universal Manifold Model (UMM)
is learned over all low-level features and represented as a set of local modes
to statistically unify all the STMs. 3) the local modes on each STM can be
instantiated by fitting to UMM, and the corresponding expressionlet is
constructed by modeling the variations in each local mode. With above strategy,
expression videos are naturally aligned both spatially and temporally. To
enhance the discriminative power, the expressionlet-based STM representation is
further processed with discriminant embedding. Our method is evaluated on four
public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our
method outperforms the known state-of-the-art by a large margin.Comment: 12 page
Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships
Context is important for accurate visual recognition. In this work we propose
an object detection algorithm that not only considers object visual appearance,
but also makes use of two kinds of context including scene contextual
information and object relationships within a single image. Therefore, object
detection is regarded as both a cognition problem and a reasoning problem when
leveraging these structured information. Specifically, this paper formulates
object detection as a problem of graph structure inference, where given an
image the objects are treated as nodes in a graph and relationships between the
objects are modeled as edges in such graph. To this end, we present a so-called
Structure Inference Network (SIN), a detector that incorporates into a typical
detection framework (e.g. Faster R-CNN) with a graphical model which aims to
infer object state. Comprehensive experiments on PASCAL VOC and MS COCO
datasets indicate that scene context and object relationships truly improve the
performance of object detection with more desirable and reasonable outputs.Comment: published in CVPR 201
Pose-adaptive Hierarchical Attention Network for Facial Expression Recognition
Multi-view facial expression recognition (FER) is a challenging task because
the appearance of an expression varies in poses. To alleviate the influences of
poses, recent methods either perform pose normalization or learn separate FER
classifiers for each pose. However, these methods usually have two stages and
rely on good performance of pose estimators. Different from existing methods,
we propose a pose-adaptive hierarchical attention network (PhaNet) that can
jointly recognize the facial expressions and poses in unconstrained
environment. Specifically, PhaNet discovers the most relevant regions to the
facial expression by an attention mechanism in hierarchical scales, and the
most informative scales are then selected to learn the pose-invariant and
expression-discriminative representations. PhaNet is end-to-end trainable by
minimizing the hierarchical attention losses, the FER loss and pose loss with
dynamically learned loss weights. We validate the effectiveness of the proposed
PhaNet on three multi-view datasets (BU-3DFE, Multi-pie, and KDEF) and two
in-the-wild FER datasets (AffectNet and SFEW). Extensive experiments
demonstrate that our framework outperforms the state-of-the-arts under both
within-dataset and cross-dataset settings, achieving the average accuracies of
84.92\%, 93.53\%, 88.5\%, 54.82\% and 31.25\% respectively.Comment: 12 pages, 15 figure
Learning Mid-level Words on Riemannian Manifold for Action Recognition
Human action recognition remains a challenging task due to the various
sources of video data and large intra-class variations. It thus becomes one of
the key issues in recent research to explore effective and robust
representation to handle such challenges. In this paper, we propose a novel
representation approach by constructing mid-level words in videos and encoding
them on Riemannian manifold. Specifically, we first conduct a global alignment
on the densely extracted low-level features to build a bank of corresponding
feature groups, each of which can be statistically modeled as a mid-level word
lying on some specific Riemannian manifold. Based on these mid-level words, we
construct intrinsic Riemannian codebooks by employing K-Karcher-means
clustering and Riemannian Gaussian Mixture Model, and consequently extend the
Riemannian manifold version of three well studied encoding methods in Euclidean
space, i.e. Bag of Visual Words (BoVW), Vector of Locally Aggregated
Descriptors (VLAD), and Fisher Vector (FV), to obtain the final action video
representations. Our method is evaluated in two tasks on four popular realistic
datasets: action recognition on YouTube, UCF50, HMDB51 databases, and action
similarity labeling on ASLAN database. In all cases, the reported results
achieve very competitive performance with those most recent state-of-the-art
works.Comment: 10 page
Learning Class Prototypes via Structure Alignment for Zero-Shot Recognition
Zero-shot learning (ZSL) aims to recognize objects of novel classes without
any training samples of specific classes, which is achieved by exploiting the
semantic information and auxiliary datasets. Recently most ZSL approaches focus
on learning visual-semantic embeddings to transfer knowledge from the auxiliary
datasets to the novel classes. However, few works study whether the semantic
information is discriminative or not for the recognition task. To tackle such
problem, we propose a coupled dictionary learning approach to align the
visual-semantic structures using the class prototypes, where the discriminative
information lying in the visual space is utilized to improve the less
discriminative semantic space. Then, zero-shot recognition can be performed in
different spaces by the simple nearest neighbor approach using the learned
class prototypes. Extensive experiments on four benchmark datasets show the
effectiveness of the proposed approach.Comment: To appear in ECCV 201
VIPL-HR: A Multi-modal Database for Pulse Estimation from Less-constrained Face Video
Heart rate (HR) is an important physiological signal that reflects the
physical and emotional activities of humans. Traditional HR measurements are
mainly based on contact monitors, which are inconvenient and may cause
discomfort for the subjects. Recently, methods have been proposed for remote HR
estimation from face videos. However, most of the existing methods focus on
well-controlled scenarios, their generalization ability into less-constrained
scenarios are not known. At the same time, lacking large-scale databases has
limited the use of deep representation learning methods in remote HR
estimation. In this paper, we introduce a large-scale multi-modal HR database
(named as VIPL-HR), which contains 2,378 visible light videos (VIS) and 752
near-infrared (NIR) videos of 107 subjects. Our VIPL-HR database also contains
various variations such as head movements, illumination variations, and
acquisition device changes. We also learn a deep HR estimator (named as
RhythmNet) with the proposed spatial-temporal representation, which achieves
promising results on both the public-domain and our VIPL-HR HR estimation
databases. We would like to put the VIPL-HR database into the public domain
AttGAN: Facial Attribute Editing by Only Changing What You Want
Facial attribute editing aims to manipulate single or multiple attributes of
a face image, i.e., to generate a new face with desired attributes while
preserving other details. Recently, generative adversarial net (GAN) and
encoder-decoder architecture are usually incorporated to handle this task with
promising results. Based on the encoder-decoder architecture, facial attribute
editing is achieved by decoding the latent representation of the given face
conditioned on the desired attributes. Some existing methods attempt to
establish an attribute-independent latent representation for further attribute
editing. However, such attribute-independent constraint on the latent
representation is excessive because it restricts the capacity of the latent
representation and may result in information loss, leading to over-smooth and
distorted generation. Instead of imposing constraints on the latent
representation, in this work we apply an attribute classification constraint to
the generated image to just guarantee the correct change of desired attributes,
i.e., to "change what you want". Meanwhile, the reconstruction learning is
introduced to preserve attribute-excluding details, in other words, to "only
change what you want". Besides, the adversarial learning is employed for
visually realistic editing. These three components cooperate with each other
forming an effective framework for high quality facial attribute editing,
referred as AttGAN. Furthermore, our method is also directly applicable for
attribute intensity control and can be naturally extended for attribute style
manipulation. Experiments on CelebA dataset show that our method outperforms
the state-of-the-arts on realistic attribute editing with facial details well
preserved.Comment: Submitted to IEEE Transactions on Image Processing, Code:
https://github.com/LynnHo/AttGAN-Tensorflo
- …